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ABSTRACT 

The DIGIT (Database of ImmunoGlobulins with 
Integrated Tools) database (http://biocomputing.it/ 
digit) is an integrated resource storing sequences 
of annotated immunoglobulin variable domains and 
enriched with tools for searching and analyzing 
them. The annotations in the database include infor- 
mation on the type of antigen, the respective germ- 
line sequences and on pairing information between 
light and heavy chains. Other annotations, such as 
the identification of the complementarity deter- 
mining regions, assignment of their structural class 
and identification of mutations with respect to the 
germline, are computed on the fly and can also be 
obtained for user-submitted sequences. The system 
allows customized BLAST searches and automatic 
building of 3D models of the domains to be 
performed. 

INTRODUCTION 

Successful recognition of foreign antigens by antibodies (or 
immunoglobulins) is crucial for the defense of an organism 
against pathogens and strictly depends upon the enor- 
mous diversity of the sequences and structures of these 
molecules. At the same time, these molecules play an ex- 
ceptionally important role in diagnosis, therapy and bio- 
technology applications. 

The effective usage of antibodies in all these applica- 
tions demands knowledge and understanding of their 
sequence and structural properties in order to study the 
molecular basis of their specificity, their 'evolutionary' 
history within the organism and to be able to modify 
them as in humanization experiments or in the design of 
combinatorial hbraries. 

There are several resources aimed at providing an 
integrated view of the sequences and structures of anti- 
bodies, each with advantages and disadvantages. 



The most renowned one is the Kabat database (1), 
which has been the textbook (and originally was indeed 
released as such) for immunologists. Unfortunately, this is 
now only available at a cost and is not regularly updated. 
The Abysis portal (2) provides some of the needed services, 
such as the possibihty of querying the database by acces- 
sion number, antigen, author name, reference, year of first 
publication, chain type (lambda or heavy or both), species, 
etc., but is limited to amino acid sequences only and 
cannot be used for nucleotide sequences. The Vbase2 
database (3) is limited to human and mouse germline se- 
quences and, most importantly, has not been updated since 
2006. IMGT (4) is a database of fully annotated sequences 
of immunoglobuhns and T-cell receptors from human and 
other vertebrates (150 species). It does not provide 
sequence-searching tools for amino acid sequences nor it 
includes information on light and heavy-chain pairing of 
the entries. 

To overcome some of the shortcomings of the systems 
described above and the problems that we ourselves faced 
when analyzing real Hfe cases (5-9), we took advantage of 
our long-lasting experience in immunoglobuhn sequence 
and structure analysis and structural prediction (8,10-19) 
and developed the DIGIT (Database of ImmunoGlobuhns 
with Integrated Tools) system. 

The annotations in our database include information on 
the type of antigen, the respective germline sequences and 
on pairing information between hght and heavy chains. 

The user can query the database using the antigen type, 
source organism, accession number, chain type (heavy, 
lambda and kappa) or free text (disease, process, etc.) 
with the option of selecting only complete immunoglobu- 
lins (i.e. cases where both the correctly paired light and 
heavy-chain sequences are available). 

Other annotations are computed on the fly (and there- 
fore can also be obtained for user-submitted sequences), 
for example: 

(1) numbering of the sequence according to the Kabat- 
Chothia numbering scheme (20); 
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(2) identifications of the complementarity determining 
regions (CDRs) in the sequence and of the frame- 
work regions; 

(3) assignment of the canonical structures for the CDRs 
(21); 

(4) identification of mutations with respect to the 
germhne; 

(5) automatic Hnk to our 3D modeling tool for immuno- 
globuhn variable domains (14); and 

(6) sequence searching that, given the input immuno- 
globuhn sequence of interest (amino acid or nucleo- 
tide sequence of heavy-chain variable domain 
sequence; hght-chain variable domain sequence or 
both), retrieves the closest sequences (sorted accord- 
ing to the ii-value or percentage of sequence 
identity). 

We believe that this is a much-needed resource as the 
information that it contains is either absent from any 
other database or can only be obtained by browsing 
several sites, most of which is not regularly updated and 
we are convinced that DIGIT will be extremely useful to 
researchers interested in immunology as well as to scien- 
tists performing experiments such as antibody humaniza- 
tion, stabilization and functionalization. 

IMMUNOGLOBULIN VARIABLE DOMAIN 
STRUCTURE AND NOMENCLATURE 

Immunoglobuhns are glycoproteins specifically binding to 
one or a few closely related antigens. All immunoglobulins 
have a four-chain structure as their basic unit. They are 
composed of two identical hght chains (L) and two iden- 
tical heavy chains (H) held together by inter-chain disul- 
fide bonds and by non-covalent interactions. Two 
domains, a variable and a constant one, form the light 
chain, while one variable domain and three constant 
domains usually form the heavy chain. Most of the diver- 
sity of the variable domains resides in three regions from 
each chain, called the hypervariable or CDRs. These are 
named according to the chain they belong to and the order 
they appear in the sequence (LI, L2, L3, HI, H2 and H3). 
The regions between the CDRs in the variable region are 
called the framework regions (FW). Imniunoglobuhn hght 
chains are classified as kappa or lambda according to their 
serological and sequence properties. 

Immunoglobulin sequences are usually numbered ac- 
cording to a common scheme (Kabat-Chothia) aimed at 
assigning the same number to topologically equivalent 
residues (20). This is a widely adopted standard for num- 
bering the residues of antibodies in a consistent manner. 

The relationship between the amino acid sequences of 
immunoglobuhns and the 3D structures of their antigen 
binding sites has been extensively studied leading to the 
identification of relatively few residues that, through their 
packing, hydrogen bonding or the ability to assume 
unusual main chain conformations, are primarily respon- 
sible for the main-chain conformations of the hyper- 
variable regions (17,18,21). The commonly occurring 
main-chain conformations of the hypervariable regions 
are called 'canonical structures'. The canonical structure 



definitions can be effectively used to predict the structure 
of immunoglobulin variable domains (12,14). 



DIGIT SYSTEM OVERVIEW 

DIGIT consists of four main modules: 

(1) the DB search tool allows the user to retrieve entries 
by chain type (heavy, lambda and kappa), source 
organism, antigen, NCBI accession number and 
free text search; 

(2) the Browse tool can be used for inspecting the 
content of the database; 

(3) the Sequence Search tool permits to perform BLAST 
searches of the database using a user provided 
nucleotide or amino acid sequence of the light or 
heavy chain or both as queries; and 

(4) the analysis toolbox provides, for a given immuno- 
globuUn sequence: 

(a) the Kabat-Chothia numbering of the chain; 

(b) the canonical structures of each of the CDRs; 

(c) the mutations with respect to the corresponding 
germline sequence; and 

(d) the construction of a 3D model of the molecule 
through our modehng PIGS (Prediction of 
Immunoglobulin Structure) server (14) when 
both light and heavy chains have been retrieved 
or uploaded for an entry. 

There are provisions for saving and later retrieving the 
results as well as for filtering the output files according to 
user-specified keywords. 

The database is regularly updated every 90 days. 

An overview of the system is schematically shown in 
Figure 1. 



DATA SOURCES 

The database presently contains 145 759 heavy-chain se- 
quences and 71404 light chain sequences (47168 kappa 
type and 24236 lambda type) retrieved using isotype- 
specific HMM profiles developed by us, with assigned 
canonical structures for the hypervariable loops and 
data on the type of antigen as well as the pairing informa- 
tion of immunoglobuhn heavy and hght chains (9672 total 
pairs). 

Sequences were retrieved from the NCBI database (22) 
(June 22, 2011) using as query (immunoglobuhn OR 
immunoglobuhns OR antibodies OR antibody OR IG 
OR Ab OR heavy OR light OR Fab OR FV). 

Light and heavy-chain sequences as well as the type of 
hght chain (lambda or kappa) were identified by 
comparing them with HMMs developed on purpose 
(available from the DIGIT web site). 

Pairing between hght and heavy chains, an information 
not reported in sequence databases, was obtained by iden- 
tifying heavy and hght-chain sequences reported by the 
same author and referring to the same publication either 
if the latter only contained one pair of hght and heavy- 
chain sequences or if the NCBI description field for both 
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Figure 1. Schematic view of the options provided by DIGIT. After selecting an entry obtained by either browsing or searching the database or 
directly submitted, the user can retrieve the Kabat-Chothia numbering of the sequence, the canonical structures of the CDRs, the mutations with 
respect to the germline and the 3D model of the molecule. Note that the input can be either a nucleotide or an amino acid sequence. The system 
provides the possibility of printing and saving the results as well as of aligning the sequences of the displayed entries. 



chains reported an unambiguous identifier after the key- 
words 'clone', 'sample' or 'isolate'. 

The name of the antigen was retrieved by searching in 
the NCBI description field the words following the 'anti' 
term. The type of antigen was attributed using a vocabu- 
lary developed on purpose. 

A manual analysis of the automatic assignment on a few 
hundreds immunoglobulins stored in the database showed 
that only a handful of assignments were incorrect, and 
these were mainly due to ambiguous description fields in 
the NCBI entry. 



DB SEARCH TOOL 

The database can be queried by chain type (heavy, lambda, 
kappa), source organism, antigen, accession number or 
through a free text search. 

The output of the DB search operation includes the 
corresponding NCBI identifier, a description of the anti- 
gen, the organism source and the reference to the original 
article. 

Fields are linked to the corresponding NCBI entry, to 
the PUBMED record for the article and to a summary 
page reporting the genomic locus, the NCBI definition 



of the entry, the organism, the reference to the original 
article, the sequences of the Hght and heavy chain, the iso- 
type of the lambda chain (kappa or lambda), the Kabat- 
Chothia numbering and, if available, the antigen and the 
antigen type (protein, peptide, carbohydrate, hapten and 
small molecule). 

A toggle button can be used to select an entry for 
further analysis (see 'Analysis Tools' section). 

BROWSE TOOL 

This is a simple interface for browsing the content of the 
database and selecting an entry to analyze. The user can 
select the chain type or (optionally) the progressive number 
of the entry from which to start. The output page contains 
a hst of the first 25 retrieved entries and of buttons to 
move to the next or previous 25 entries. 

SEQUENCE SEARCH TOOL 

The user can provide the amino acid or nucleotide 
sequence(s) of a hght or heavy chain or both. 

The system performs one or more Blast searches with 
default parameters. If the input consists of a light and 
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heavy chain, the database is searched with the Hght chain, 
with the heavy chain and with both. The latter resuh is 
useful, for example, for humanization experiments or for 
selecting a modeling template. 

In both cases, for example, it might be convenient to 
select the chains of the same immunoglobuhn as templates 
for both the hght and heavy chain rather than chains 
coming from different antibodies, since even minor differ- 
ences in the interface can lead to a different packing 
geometry and affect the topology of the antigen binding 
site (11). 

In the description below, we will assume that the user 
wishes to search the database with a complete variable 
domain including both the light and heavy chains. The 
only difference from the case when only one of the two 
chains is used as input is that, obviously, only the results 
for the selected chain are available in this case. 



ANALYSIS TOOLS 

Upon completion of the search, the user can access several 
pages using tabs labeled 'Blast L', 'Blast H', 'Blast L+H', 
'Kabat numbering', 'Canonical structures', 'Features', 
'Mutations' and '3D model' (Figure 2). 

The first three options provide the Blast results for the 
light and heavy chain and those obtained by 



concatenating the two chains and searching a dataset 
where paired hght and heavy-chain sequences are 
concatenated as weU. 

These pages contain the NCBI sequence identifier of 
the retrieved sequence(s), the definition(s), the reference, 
the antigen, the percentage of sequence identity and the 
ii-value as reported by BLAST. Each column can be used 
for sorting the results and columns can be moved around 
using the mouse. 

The Kabat numbering page reports the alignment of the 
input sequence with the commonly used Kabat-Chothia 
numbering scheme (20). The canonical structures (describ- 
ing the main chain conformation of the CDRs) are also 
reported together with the length of the CDRs. 

The Features page is a summary reporting the se- 
quences of the various regions of the molecule separately 
in the order they appear in the sequence (Framework 1, 
CDRl, Framework 2, CDR2, Framework 3, CDR3 and 
Framework 4). 

The 'Mutations' page includes information about the 
mutations with respect to the germline of the selected 
sequence. If the input sequence was provided as a nucleo- 
tide sequence, both the nucleotide and the corresponding 
amino acid mutations are shown. The Blosum62 score for 
the amino acid variation is also shown below each 
mutation. 
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Figure 2. The output page of the DB search tool. The left panel hsts previous jobs the results of which can be retrieved. The current job is enclosed 
in a box. The various tabs provide access to the corresponding results. Each of the entries can be used as starting point for a new analysis by clicking 
the 'A' button. The results can be printed, saved or filtered according to user-defined keywords. The multiple sequence alignment for all the displayed 
entries, obtained by clicking the 'Alignment' button, is displayed on a new page. 
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Finally, the user can directly obtain a 3D model of the 
input sequence through the PIGS immunoglobulin 
modeling tool (14). 

In all cases when a hst of sequences is displayed on the 
page, it is possible to obtain an ahgnment of their se- 
quences in a new window. If both light and heavy chains 
are hsted on the page (as for example in a Blast 'L+H' 
output), the alignment will include both light and 
heavy-chain sequences separated by the '//' symbol. 

CONCLUSIONS 

Several biomedical and biotechnological projects need to 
take advantage of a detailed understanding of the 
immunoglobulin features. Humanization and combinator- 
ial library design require an analysis of the properties of 
the framework of the molecules and of their similarity 
with other antibodies from the same or a different organ- 
ism, immunologically based diagnosis can be helped by 
comparing the sequences of different immunoglobulins 
as well as by understanding their mutation patterns with 
respect to the germline and by inspecting their known or 
predicted 3D structures. 

In all these cases, the light and heavy chains of the 
antibody cannot be treated as separate entities (as is the 
case in sequence databases). Furthermore, the analyses 
have to be easy to perform within a single site and 
reported in a language and with an organization reflecting 
the flowchart of the design of an experiment. 

We beheve that the DIGIT system described here meets 
all these requirements and can become a 'one stop shop' 
for the biomedical and biotechnological community inter- 
ested in immunoglobulins. 
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